-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Bubble up errors from HARouting unless using StandbyFallbackException #7238
fix: Bubble up errors from HARouting unless using StandbyFallbackException #7238
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a first pass on non-testing code. I'm wondering if we are going to capture/swallow the new type of exceptions anyways, could we just let the executeOrRouteQuery
to return something for the caller that captures standby fallback
instead of the exception throwing path?
pullQueryMetrics | ||
.ifPresent(queryExecutorMetrics -> queryExecutorMetrics.recordRemoteRequests(1)); | ||
forwardTo(node, locations, statement, serviceContext, pullQueryQueue, rowFactory, | ||
outputSchema); | ||
} catch (StandbyFallbackException e) { | ||
LOG.error("Standby Fallback: Error forwarding query {} to node {} with exception {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to log it as error really, since we are about to retry on the standby right?
Also if we intend to expose this to users, they may not really understand what standby fallback
means. What about just "Failed to execute query {} on node {}, will try to switch to execute on the standby query state which may result in stale results"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that error
is maybe too severe. It's still not perfect operation, so I think warn
is most appropriate.
This error is only meant to show up in the logs and not be surfaced via the CLI, but I agree it's a bit unclear. I changed it to "Error executing query {} locally at node {}. Falling back to standby state which may return stale results"
} catch (StandbyFallbackException e) { | ||
LOG.error("Standby Fallback: Error executing query {} locally at node {} with exception", | ||
statement.getStatementText(), node, e.getCause()); | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} catch (StandbyFallbackException e) { | ||
LOG.error("Standby Fallback: Error forwarding query {} to node {} with exception {}", | ||
statement.getStatementText(), node, e.getCause()); | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since on the caller we are going to swallow the StandbyFallbackException
and retry on standbys, how about letting executeOrRouteQuery
to return a result (for now, just boolean is okay; in general, maybe a enum can do more things) where the caller can check on the futures
, instead of capturing and opening the exception envelope to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, good suggestion. I just created an enum to make it a little easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM! |
… not being caught if it was non network
Description
HARouting
has previously treated all Exceptions the same and used them as reason to retry with standbys. The issue is that certain errors are deterministic and often caused by some form of user error or code bug. Not only is there no use in retrying something that will have the same effect on a standby, but if we treat it similar to a transient error and print the error messageUnable to execute pull query: "%s". Exhausted standby hosts to try.
, it's rather confusing since trying again won't fix the issue.The goal of this PR is to surface any bug by requiring any standby retries to explicitly use
StandbyFallbackException
and by bubbling up other exceptions by default. In particular, any network error gets wrapped in aStandbyFallbackException
and everything else, including local lookup errors, is surfaced.Fixes #6799
Testing done
I rewrote much of
HARoutingTest
since it wasn't well tested before and stopped at the routing step. It's now tested all the way down to either the network call or execution at the physical plan layer.Reviewer checklist